Safe Policy Iteration

نویسندگان

  • Matteo Pirotta
  • Marcello Restelli
  • Alessio Pecorino
  • Daniele Calandriello
چکیده

CONTRIBUTIONS 1. Theoretical contribution. We introduce a new, more general lower bound to the policy improvement of an arbitrary policy compared to another policy based on the ability to bound the distance between the future state distributions. 2. Algorithmic contribution. We define two approximate policy–iteration algorithms whose policy improvement moves toward the estimated greedy policy by maximizing the policy improvement bounds. 3. Empirical contribution. We report results on a simple chain walk and BlackJack domains that confirm the main theoretical findings. PROBLEM • Classical API approaches may generate a policy πt+1 that performs worst than the previous policy πt. • This undesired improvement may lead to the policy oscillation phenomenon that can prevent convergence to the optimal policy and degrade the learning process. • Our “safe” approach tries to overcome this issue visiting a sequence of policies with monotonic improving performance. Following this approach, the policy is constrained to improve overtime and, as a consequence, the degradation of the policy performance between consecutive iteration is prevented.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Policy Iteration in Finite Templates Domain

We prove in this paper that policy iteration can be generally defined in finite domain of templates using Lagrange duality. Such policy iteration algorithm converges to a fixed point when for very simple technique condition holds. This fixed point furnishes a safe over-approximation of the set of reachable values taken by the variables of a program. We prove also that policy iteration can be ea...

متن کامل

Nonconvex Policy Search Using Variational Inequalities

Policy search is a class of reinforcement learning algorithms for finding optimal policies in control problems with limited feedback. These methods have been shown to be successful in high-dimensional problems such as robotics control. Though successful, current methods can lead to unsafe policy parameters that potentially could damage hardware units. Motivated by such constraints, we propose p...

متن کامل

Parallel Optimization of Motion Controllers via Policy Iteration

This paper describes a policy iteration algorithm for optimizing the performance of a harmonic function-based controller with respect to a user-defined index. Value functions are represented as potential distributions over the problem domain, being control policies represented as gradient fields over the same domain. All intermediate policies are intrinsically safe, i.e. collisions are not prom...

متن کامل

On Controlled Markov Chains with Optimality Requirement and Safety Constraint

We study the control of completely observed Markov chains subject to generalized safety bounds and optimality requirement. Originally, the safety bounds were specified as unit-interval valued vector pairs (lower and upper bounds for each component of the state probability distribution). In this paper, we generalize the constraint to be any linear convex set for the distribution to stay in, and ...

متن کامل

Nested Value Iteration for Partially Satisfiable Co-Safe LTL Specifications

Overview We describe our recent work (Lacerda, Parker, and Hawes 2015) on cost-optimal policy generation for cosafe linear temporal logic (LTL) specifications that are not satisfiable with probability one in a Markov decision process (MDP) model. We provide an overview of the approach to pose the problem as the optimisation of three standard objectives in a trimmed product MDP. Furthermore, we ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013